The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon

نویسندگان

  • Barbara McGillivray
  • Marco Passarotti
  • Paolo Ruffolo
چکیده

We present an overview of the Index Thomisticus Treebank project (IT-TB). The ITTB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11million-token Latin corpus of the texts by Thomas Aquinas. We briefly describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is reported on. We present training and parsing results on several datasets and provide evaluation of learning algorithms and techniques. Furthermore, we introduce the IT-TB valency lexicon extracted from the treebank. We report on quantitative data of the lexicon and provide some statistical measures on subcategorisation structures. RÉSUMÉ. Nous présentons une vue d’ensemble du projet de l’Index Thomisticus Treebank (ITTB). L’IT-TB consiste d’environ 60,000 occurrences tirées de l’Index Thomisticus de Roberto Busa SJ, un corpus de onze millions de mots latins de Thomas d’Aquin. Nous décrivons brièvement les règles d’étiquetage, qui sont en commun avec la Latin Dependency Treebank (LDT). Nous décrivons l’application des parseurs probabilistes dépendanciels sur les données de l’ITTB et de la LDT. Nous présentons les résultats de l’entraînement et de l’analyse syntactique sur plusieurs ensembles des données et nous fournissons une évaluation des algorithmes et des techniques d’apprentissage. En outre, nous introduisons le lexique de valence de l’IT-TB tiré de la treebank. Nous reportons les données quantitatives du lexique et nous fournissons quelques mesures statistiques sur les structures de sous-catégorisation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Development of the "Index Thomisticus" Treebank Valency Lexicon

We present a valency lexicon for Latin verbs extracted from the Index Thomisticus Treebank, a syntactically annotated corpus of Medieval Latin texts by Thomas Aquinas. In our corpus-based approach, the lexicon reflects the empirical evidence of the source data. Verbal arguments are induced directly from annotated data. The lexicon contains 432 Latin verbs with 270 valency frames. The lexicon is...

متن کامل

Selectional Preferences from a Latin Treebank

We present a system for automatically acquiring selectional preferences for Latin verbs. We use the Index Thomisticus Treebank Valency Lexicon and an enriched version of Latin WordNet as the reference conceptual hierarchy.

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin

The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of co...

متن کامل

Building a Bilingual ValLex Using Treebank Token Alignment: First Observations

In this paper we explore the potential and limitations of a concept of building a bilingual valency lexicon based on the alignment of nodes in a parallel treebank. Our aim is to build an electronic Czech↔English Valency Lexicon by collecting equivalences from bilingual treebank data and storing them in two already existing electronic valency lexicons, PDT-VALLEX and Engvallex. For this task a s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • TAL

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2009